Bioconductor: software for orchestrating high-throughput genomics data analysis


Sean Davis, MD, PhD
Center for Cancer Research
National Cancer Institute
@seandavis12
https://seandavi.github.io

December 4, 2018

Overview

Background and introduction

Two example workflows

  • Principle components analysis of public gene expression data
  • Somatic variant analysis of TCGA melanoma data

Additional functionality



These slides were written in R and all plots and results are generated from code in the slides–reproducible research in action!

https://github.com/seandavi/Talks

Introduction

img

But what is Bioconductor?

Bioconductor is a Toolkit for high-throughput biological data science built with reproducible computational research as a core value.

Bioconductor software supports all aspects the biological data science workflow.

  • Data import and export
  • Data management and accessibility
  • Data integration, including with public data
  • Analysis and statistics
  • High-level visualization
  • Efficiency and reproducibility, even at scale

Bioconductor Packages

The Bioconductor Ecosystem

Bioconductor: Education, Training, and Community

SummarizedExperiment

Example 1: Build a PCA plot from Public Data

Use the GEOquery package to fetch data about GSE103512.

library(GEOquery)
gse = getGEO("GSE103512", destdir='/tmp')[[1]]
gse
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54715 features, 280 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM2772660 GSM2772661 ... GSM2772939 (280 total)
##   varLabels: title geo_accession ... weight:ch1 (73 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-M_at (54715
##     total)
##   fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16
##     total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL13158

Example 1: Build a PCA plot from Public Data

Examine two variables of interest, cancer type and tumor/normal status.

library(xtable)
print(xtable(with(pData(gse),table(`cancer type:ch1`,`normal:ch1`))),type='html')
no yes
BC 65 10
CRC 57 12
NSCLC 60 9
PCA 60 7

Example 1: Build a PCA plot from Public Data

Information about features measured are also included.

Example 1: Build a PCA plot from Public Data

Filter gene expression by variance to find most informative genes.

sds = apply(exprs(gse),1,sd)
dat = exprs(gse)[order(sds,decreasing = TRUE)[1:500],]

Perform multidimensional scaling and prepare for plotting.

mdsvals = cmdscale(dist(t(dat)))
mdsvals = as.data.frame(mdsvals)
mdsvals$Type=factor(pData(gse)[,'cancer type:ch1'])
mdsvals$Normal = factor(pData(gse)[,'normal:ch1'])

And do the plot.

library(ggplot2)
ggplot(mdsvals, aes(x=V1,y=V2,shape=Normal,color=Type)) + 
    geom_point(size=4, alpha=0.6) + theme(text=element_text(size = 18))

Example 1: Build a PCA plot from Public Data

Example 2: Melanoma Variants from NCI Genomic Data Commons Data

Imgur

Example 2: Melanoma Variants from NCI Genomic Data Commons Data

Use the GenomicDataCommons package to find and download variants from the TCGA cutaneous melanoma dataset.

library(GenomicDataCommons)
fnames = files() %>% 
    GenomicDataCommons::filter(~ cases.project.project_id=='TCGA-SKCM' & 
        data_type=='Masked Somatic Mutation' &
        data_format=='MAF' &
        analysis.workflow_type=='MuTect2 Variant Aggregation and Masking') %>%
    ids() %>%
    gdcdata()

And now take those data directly to maftools for analysis and visualization.

library(maftools)
melanoma = read.maf(maf = fnames[1])

Example 2: Melanoma Variants from NCI Genomic Data Commons Data

Example 2: Melanoma Variants from NCI Genomic Data Commons Data

Example 2: Melanoma Variants from NCI Genomic Data Commons Data

v = oncodrive(maf = melanoma, AACol = 'HGVSp_Short', minMut = 5, pvalMethod = 'zscore')
plotOncodrive(v,labelSize=40,fdrCutOff = 0.1)

Example 2: Melanoma Variants from NCI Genomic Data Commons Data

Other functionality: Ranges

Imgur

Other functionality: Sequencing

Imgur

Data access and management

Hubs and data packages give users access to curated and versioned public data.

  • AnnotationHub
    • Reference genomes
    • Gene models
    • ENCODE data
    • UCSC Tracks
  • ExperimentHub
    • Curated and versioned public datasets (TCGA, 1000Genomes)
    • Searchable metadata
  • Experiment data packages
    • Easy sharing of data within lab, institute, or internationally

Reproducible research

  • Everything in bioconductor is version controlled and versioned.
  • Package build system guarantees compatibility between packages.
  • Documentation and workflows are “literate”, so if they do not work or do not match code, Bioc build system alerts developer.
  • Strong reliance on interoperability to reduce “reinvention” (and decrease mistakes)

Questions

Contact me:

Players: https://bioconductor.org/about/advisory-board/